## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
## [1] 4898 13
The white wine quality dataset consists of data from 4898 samples of the
Portuguese Vinho Verde white wine. There are 11 input variables, based on
physicochemical tests and one output variable “quality”" based on grades given
by wine experts.
The goal of this investigation is to find out if there are related variables in
this dataset and which of the pysicochemical variables have a relationship with
the dependent variable quality.
Being only a wine drinker and not a wine expert, I decided to read more about
the chemical composition of white wine first in order to understand what kind
of data is present in the dataset. Please refer to the readme-document for a
list of sources.
Acids greatly contribute to the taste of wine (source: Waterhouse Lab). However,
volatile acid is undesirable and should be below 1.2 g/dm3 (source: Winefolly).
pH is a measure of active acidity: the lower the pH, the higher the acidity
and vice versa.
Furthermore, for each type of wine there is an optimal range of alcohol
percentage, for Vinho Verde this range is 8% till 11.5% (source: Wikipedia).
The same is more or less true for residual sugar level: sugar is also related
to the type of wine, so it would be interesting to see if there is an “optimal”
level of residual sugar for good quality Vinho Verde wines.
Although some sulfites are produced by the alcohol fermentation process,
sulfur dioxide (SO2) is usually added to wine as a preservative
(source: Winobrothers).
The total amount of sulfur dioxide is the sum of the amount of free sulfur
dioxide and bound sulfur dioxide. In the dataset, only the total and free
amount of sulfur dioxide are present. In the description of the dataset, it is
stated that concentrations of free SO2 of 50 ppm and higher becomes evident in
nose and taste. Therefore, it would be interesting to see if higher values
(> 50 ppm) for free SO2 result in lower quality ratings. In the dataset, there
is another related variable “sulphates”. This is potassium sulphate, a wine
additive which can contribute to sulfur dioxide gas (S02) levels.
Although I will try to explore all relations between the variables in the
dataset, this initial reading did give me some more specific questions to
focus on as well. In summary:
The main question: are there physicochemical variables in this dataset that
are good predictors of the quality of Vinho Verde white wine, and if yes,
which variables?
According to literature, acids greatly contribute to quality of wine. Is
there a relationship between between acidity and wine quality?
When the level of volatile acids is too high, it negatively affects taste.
Is there a negative relation between volatile acidity and quality?
According to wine making sources, the lower the pH, the higher the acidity is
and vice versa. Can we see this (negative) relation in the this dataset as well?
The optimal alcohol percentage for Vinho Verde wines is between 8% and 11.5%.
Can we see that wines with an alcohol percentage outside this range have lower
quality ratings?
The taste of wine is impacted by sulfur dioxide levels higher than 50 ppm.
Is there a negative relation between total sulfur dioxide level and quality?
There are two important observations about the dataset that are important if we
were to build a model:
The dataset is highly imbalanced, there are many more average wines than
excellent or poor ones.
Some of the variables in the dataset are interrelated. For example: density
is related to alcohol and residual sugar, pH is related to the variables for
acidity and sulphates is related to the sulfur dioxide variables.
Although in the data description it is stated that there were no missing
I first checked if this was true. Indeed, no missing values were returned.
## X fixed.acidity volatile.acidity
## 0 0 0
## citric.acid residual.sugar chlorides
## 0 0 0
## free.sulfur.dioxide total.sulfur.dioxide density
## 0 0 0
## pH sulphates alcohol
## 0 0 0
## quality
## 0
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
After inspecting the structure and the first five rows of the dataset, there are
three changes I would like to make:
Leave out the column “X” which contains the row numbers and serves just as
an identifier. This column is not useful for the analysis.
Although the variable quality is expressed as an integer, it is in fact a
factor variable. Therefore, I want to transform this variable to a factor
variable. I decide to keep the integer version of variable quality as well.
Seven levels for the factor variable quality is quite a high number, so I
create an additional factor variable quality.bin that groups quality ratings
together: ratings 3 and 4 are poor quality, ratings 5, 6 and 7 are average and
ratings 8 and 9 are good qualit wines. By using this variable, I might be able
to see more clear patterns than with seven levels.
First, I want to get a feel for the distribution of values in the dataset.
To do this, I run a summary on all variables.
## fixed.acidity volatile.acidity citric.acid pH
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. :2.720
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.:3.090
## Median : 6.800 Median :0.2600 Median :0.3200 Median :3.180
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean :3.188
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.:3.280
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :3.820
##
## alcohol residual.sugar density chlorides
## Min. : 8.00 Min. : 0.600 Min. :0.9871 Min. :0.00900
## 1st Qu.: 9.50 1st Qu.: 1.700 1st Qu.:0.9917 1st Qu.:0.03600
## Median :10.40 Median : 5.200 Median :0.9937 Median :0.04300
## Mean :10.51 Mean : 6.391 Mean :0.9940 Mean :0.04577
## 3rd Qu.:11.40 3rd Qu.: 9.900 3rd Qu.:0.9961 3rd Qu.:0.05000
## Max. :14.20 Max. :65.800 Max. :1.0390 Max. :0.34600
##
## free.sulfur.dioxide total.sulfur.dioxide sulphates quality
## Min. : 2.00 Min. : 9.0 Min. :0.2200 Min. :3.000
## 1st Qu.: 23.00 1st Qu.:108.0 1st Qu.:0.4100 1st Qu.:5.000
## Median : 34.00 Median :134.0 Median :0.4700 Median :6.000
## Mean : 35.31 Mean :138.4 Mean :0.4898 Mean :5.878
## 3rd Qu.: 46.00 3rd Qu.:167.0 3rd Qu.:0.5500 3rd Qu.:6.000
## Max. :289.00 Max. :440.0 Max. :1.0800 Max. :9.000
##
## quality.f quality.bin
## 3: 20 poor : 183
## 4: 163 average:4535
## 5:1457 good : 180
## 6:2198
## 7: 880
## 8: 175
## 9: 5
Two things stand out when looking at the summaries:
For most of the variables, there are extreme outliers in the dataset. With
Tukey’s definition for extreme outliers (Q3 + 3 * IQR) in the head you can see
(without doing any calculations) that extreme outliers are present or fixed
acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur
dioxide, total sulfur dioxide, density and sulphates. Because of these outliers
all variables are positively skewed, which results in a median that is lower
than the mean.
The wine experts did not give quality ratings lower than 3 and never gave a
rating of 10. Also, the imbalance in quality ratings is clearly visible: there
are only 5 samples with a quality rating of 9 and only 20 samples with a quality
rating of 3, compared to (for example) almost 2200 samples with quality rating 6.
Plotting histograms of all the variables confirms the observations made from
the variable summaries: there are some extreme outliers (sometimes not even
visible in these small plots but recognizable from the stretched out x-axes.
The distribution for sulphates seems to be bimodal. Apart from the extreme
outliers, the distributions for most variables seem more or less normal.
In order to reduce skewness of the distributions, I try to use a log scale for
each of the variables. Unfortunately, for most of the variables the skewness
remains. Instead I try to set sensible binwidths and limits to the x-axis.
After some experimenting, I feel that these plots capture the distributions of
the bulk of the values best. I first tried to leave out the highest 5% of
observations but the number of rows dropped would be too high in my opinion.
When taking the lowest 97.5%, the extreme outliers are ignored in the plot
which leads to cleaner pictures. I will use these settings in the
multivariate analyses as well.
There are two factor variables in the dataset: quality.f (a variable that
contains all original quality ratings from 3 to 9) and quality.bin (a variable
that contains combined levels from the original variable). Both variables will
be used during analysis; sometimes using the bins can reveal a bigger picture
or trend that cannot be seen when all levels are present, on the other hand
the original levels show more detailed trends than the combined levels can.
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
## poor average good
## 183 4535 180
The bar charts for both variables that binning the variables does not solve the
problem of the highly imbalanced dataset. Combining the levels does create
larger groups for poor and high quality wines however, which allows us to make
statements about these two groups (where the groups of 20 samples for quality
rating 3 and 5 samples for quality rating 9 were too small).
The most important finding in the univariate analysis is that there are extreme
outliers for almost all physicochemical variables. At this point in the
analysis, it is not yet clear if there is a pattern in the extreme outliers
(for example: are extreme outliers related to low or just high quality?),
but it will certainly be interesting to do some analyses with and without
outliers.
A few observations from the boxplots:
Therefore, I create the boxplots again but use the quality bins instead, which
leads to slightly larger groups for poor (rating 3 and 4) and good (ratings 8
and 9) quality wines.
Please note that I did not limit or transform the variables. The boxplots
were meant to show where the outliers are for each variable.
The boxplots created with the quality bins show some more information:
Next, I will run some ggpairs plots to see some possible relations between
variables in one glance. Unfortunately, including all variables in the plot
results in a plot that is unreadable, so I decided to group variables.
Please note that I did not limit or transform the variables. The boxplots
were meant to show where the outliers are for each variable.
The same applies to the plots below. Since I have no reason to believe that the
outliers are bad measurements, they should be included when calculating and
visualizing relationships. The outliers are removed only in the multivariate
plots to make the plots more interpretable.
## quality.bin mean
## 1 poor 0.3759836
## 2 average 0.2743076
## 3 good 0.2779722
Observations:
Observations:
Observations:
To see all relations in one table, I create a correlation matrix:
## fixed.acidity volatile.acidity citric.acid pH
## fixed.acidity 1.00 -0.02 0.29 -0.43
## volatile.acidity -0.02 1.00 -0.15 -0.03
## citric.acid 0.29 -0.15 1.00 -0.16
## pH -0.43 -0.03 -0.16 1.00
## alcohol -0.12 0.07 -0.08 0.12
## residual.sugar 0.09 0.06 0.09 -0.19
## density 0.27 0.03 0.15 -0.09
## chlorides 0.02 0.07 0.11 -0.09
## free.sulfur.dioxide -0.05 -0.10 0.09 0.00
## total.sulfur.dioxide 0.09 0.09 0.12 0.00
## sulphates -0.02 -0.04 0.06 0.16
## quality -0.11 -0.19 -0.01 0.10
## alcohol residual.sugar density chlorides
## fixed.acidity -0.12 0.09 0.27 0.02
## volatile.acidity 0.07 0.06 0.03 0.07
## citric.acid -0.08 0.09 0.15 0.11
## pH 0.12 -0.19 -0.09 -0.09
## alcohol 1.00 -0.45 -0.78 -0.36
## residual.sugar -0.45 1.00 0.84 0.09
## density -0.78 0.84 1.00 0.26
## chlorides -0.36 0.09 0.26 1.00
## free.sulfur.dioxide -0.25 0.30 0.29 0.10
## total.sulfur.dioxide -0.45 0.40 0.53 0.20
## sulphates -0.02 -0.03 0.07 0.02
## quality 0.44 -0.10 -0.31 -0.21
## free.sulfur.dioxide total.sulfur.dioxide sulphates
## fixed.acidity -0.05 0.09 -0.02
## volatile.acidity -0.10 0.09 -0.04
## citric.acid 0.09 0.12 0.06
## pH 0.00 0.00 0.16
## alcohol -0.25 -0.45 -0.02
## residual.sugar 0.30 0.40 -0.03
## density 0.29 0.53 0.07
## chlorides 0.10 0.20 0.02
## free.sulfur.dioxide 1.00 0.62 0.06
## total.sulfur.dioxide 0.62 1.00 0.13
## sulphates 0.06 0.13 1.00
## quality 0.01 -0.17 0.05
## quality
## fixed.acidity -0.11
## volatile.acidity -0.19
## citric.acid -0.01
## pH 0.10
## alcohol 0.44
## residual.sugar -0.10
## density -0.31
## chlorides -0.21
## free.sulfur.dioxide 0.01
## total.sulfur.dioxide -0.17
## sulphates 0.05
## quality 1.00
In this matrix, I can see the following moderate (0.4-0.59) to strong (0.6-0.79)
correlations:
* fixed acidity and pH (-0.43)
* alcohol and residual sugar (-0.45)
* alcohol and density (-0.78)
* alcohol and total sulfur dioxide (-0.45)
* density and residual sugar (0.84)
* density and total sulfur dioxide (0.53)
* residual sugar and total sulfur dioxide (0.40)
* free and total sulfur dioxide (0.62)
In a multivariate analysis, it will be especially interesting to explore two
variables that are not interrelated like alcohol or residual sugar with
total sulfur dioxide, grouped by quality rating.
To find correlation between the (continuous) physicochemical variables and the
(ordinal) variable quality, I use Spearman’s correlation:
##
## Spearman's rank correlation rho
##
## data: ww$fixed.acidity and ww$quality
## S = 2.1239e+10, p-value = 3.183e-09
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.08448545
##
## Spearman's rank correlation rho
##
## data: ww$volatile.acidity and ww$quality
## S = 2.3434e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1965617
##
## Spearman's rank correlation rho
##
## data: ww$citric.acid and ww$quality
## S = 1.9225e+10, p-value = 0.1996
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.01833273
##
## Spearman's rank correlation rho
##
## data: ww$pH and ww$quality
## S = 1.7442e+10, p-value = 1.656e-14
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1093621
##
## Spearman's rank correlation rho
##
## data: ww$alcohol and ww$quality
## S = 1.096e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4403692
##
## Spearman's rank correlation rho
##
## data: ww$residual.sugar and ww$quality
## S = 2.1191e+10, p-value = 8.822e-09
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.08206979
##
## Spearman's rank correlation rho
##
## data: ww$density and ww$quality
## S = 2.6406e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.348351
##
## Spearman's rank correlation rho
##
## data: ww$chlorides and ww$quality
## S = 2.5743e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.3144885
##
## Spearman's rank correlation rho
##
## data: ww$free.sulfur.dioxide and ww$quality
## S = 1.912e+10, p-value = 0.09703
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.02371338
##
## Spearman's rank correlation rho
##
## data: ww$total.sulfur.dioxide and ww$quality
## S = 2.3436e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1966803
##
## Spearman's rank correlation rho
##
## data: ww$sulphates and ww$quality
## S = 1.8932e+10, p-value = 0.01971
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.03331897
For all physicochemical variables except citric acid and free sulfur dioxide,
we can reject the null hypothesis that there is no association between the
variables and quality rating. However, none of those correlations are strong:
There is a moderate positive relation between alcohol and quality, and a weak
negative relation between density and quality and chlorides and alcohol.
##
## Spearman's rank correlation rho
##
## data: ca90$citric.acid and ca90$quality
## S = 1.2307e+10, p-value = 8.783e-10
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.09294089
To get a feeling for the impact of the extreme outliers on the strength of
the relationship between variables, I decided to pick one and visualize the
linear relationship with and without outliers. Of course, a scatterplot is not
the best choice for visualizing a categorical variable but it was just to get an
idea.
As it turns out, it is possible to turn a non-significant relationship into a
significant relationship, see the relation between citric acid and quality above.
However, doing this would only be legitimate if the outliers in the dataset
are an indication of faulty data but I have no reason to believe the outliers
are coming from bad data. Wine making is a natural process that apparently
sometimes results in more extreme values for the chemical properties.
When looking at relations between the chemical variables in the dataset, we can
see that there are only strong relationships between density and alcohol
(negative) and density and residual sugar (positive). Both alcohol and residual
sugar largely determine the density of wine.
Other relationships between chemical values are only moderately strong, such as
the relations between alcohol and total sulfur dioxide, density and total sulfur
dioxide and residual sugar and total sulfur dioxide.
For all physicochemical variables except citric acid and free sulfur dioxide,
we can reject the null hypothesis that there is no association between the
variables and quality rating. However, none of those correlations are strong:
there is a moderate positive relation between alcohol and quality, and a weak
negative relation between density and quality and chlorides and alcohol.
In this section, I will explore the relationships found in the previous
section by adding a third variable.
The plot above shows the relation between alcohol and total sulfur dioxide. The
shape of the cloud matches the moderately strong negative correlation (-0.45).
Also, the relation between alcohol and quality is visible: the blue spots for
higher quality tend to be on the right side the plot (higher alcohol percentage).
However, the blue spots are spread out quite evenly over the vertical axis,
which means that there is not a strong relation between quality and total
sulfur dioxide.
For all plots in this section I leave out the highest 1% values to improve
readability of the plot.
This plot does show a relation between total sulfur dioxide and residual sugar
(the shape of the cloud moves in an slightly upward direction for the higher
values of both sulfur dioxide and residual sugar) but there does not seem to
be a relation between quality and both chemical variables. The blue and
red/orange spots for poor and good quality wines seem to be spread over both
the x- and y-axis quite evenly.
Density and residual sugar have the strongest correlation and it shows in the
plot. Also it is clear that higher quality wines tend to be on the lower side
of the cloud, meaning that high quality wines tend to have a lower density.
This was confirmed by the (rather weak) correlation coefficient of -0.35 for
density and quality.
Running the same plot but with the quality bins confirms the picture from the
previous plot: the good (green dots) quality wines tend to have a lower density
where poor quality wine (orange dots) tend to have a higher density.
Alcohol and density are negatively correlated (meaning: the higher the alcohol
percentage, the lower the density) but the same relation as found before is also
visible in this plot: higher quality wines tend to have higher alcohol
percentages.
As we are predicting the outcome of a ordinal categorical variable, a linear
model cannot be used. Instead, ordinal logistic regression would be a good
choice for building a model for this dataset. However, I don’t have
experience with this regression method and therefore I am not certain to
interpret the results correctly. I will create such a model in a later stage.
The multivariate plots confirmed the relations found in the bivariate analysis.
The relationship between alcohol percentage and quality stands out, the variable
alcohol can be used as a predictor for the quality of wine. Also, the plot
for density versus residual sugar clearly shows that the higher quality wines
are on the lower side of the scattercloud, meaning that higher quality wines
tend to have lower levels of residual sugar and therefore lower density.
During univariate analysis, this plot gave a quick overview of the distribution
all numeric variables in the dataset. It showed the presence of extreme outliers,
as can be seen in the distribution for chlorides. By chosing a very small
binwidth we are able to see the very long right tail of the distribution.
## 75%
## 21.14286
In fact, the maximum value is 21.14 times the interquartile range higher than
the value of the 3rd quartile, where 3 times the interquartile range above the
3rd quartile can already be seen as an extreme outlier (according to Tukey).
The variable chlorides is an extreme example, but there are more variables in
the dataset for which limits were required in order to make relations better
visible.
##
## Spearman's rank correlation rho
##
## data: ww$alcohol and ww$quality
## S = 1.096e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4403692
This plot is lifted out of the ggpairs plot in the bivariate section and shows
the relation between alcohol and quality. Please note that “density” in this
plot does not refer to the variable density but to the frequency density.
In this plot it is showed clearly that the distributions for the poor and
average wine differ from the distribution for good wines. Also, the mean
alcohol percentage for wine is higher than for the other two groups.
By choosing to use the quality bins instead of all levels of the variable quality
we can make the the poor and good quality wines stand out. The image confirms
that density and residual sugar have a strong correlation. Also it is clear
that higher quality wines tend to be on the lower side of the cloud, meaning
that high quality wines tend to have a lower density. This was confirmed by
the (rather weak) correlation coefficient of -0.35 for density and quality.
It is now time to answer the questions stated in the first part of the report:
There is only one variable (alcohol) that has a moderately strong positive
relationship with the quality of white wine. This variable can be used to
predict wine quality. With the exception of variables citric acid and free
sulfur dioxide, all other relations are weak but significant.
The correlation tests for fixed acidity and citric acid showed that there is
a relationship between both variables and quality, but the relation is very
weak.
The level of volatile acidity becomes faulty when it exceeds 1.2 g/dm3. None
of the observations in the dataset did indeed exceed this value. However,
a comparison of the mean of the quality groups show that poor quality wines
have on average a higher level of volatile acidity.
This relationship is demonstrated most clearly for the variable fixed acidity
and pH. Indeed, there is a moderately strong negative relation.
Plot 2 in the summary above shows rather the contrary: higher quality wines
tend to have higher levels of alcohol, with an average alcohol level for good
quality wine of more than 11.5%.
There is a negative relation between total sulfur dioxide and quality, but
it is fairly weak with a correlation coefficient of -0.2.
The first and arguably most important lesson that I learned when I started out
exploring this dataset about wine is that although in theory it is possible to
explore and visualize a well-documented dataset, in practice you will always
need some domain knowledge as a data analyst.
Second, when doing a course often datasets are used that are know to have
strong relationships between variables in them. This dataset learned me
that this is not always the case and that weak but significant relations
between variables are also relations.
This was just a first exploration. Further work can be done by using ordered
logistic regression to predict the quality of white wine. Once we have a model
we could see if this model also works for other types of white wine.
And there is also a dataset available for the red variant of Vinho Verde wines,
we could compare these two datasets.